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Abstract — While random linear networli coding is a powerful 
tool for disseminating information in communication networks, 
it is highly susceptible to errors caused by various sources. Due 
to error propagation, errors greatly deteriorate the throughput 
of network coding and seriously undermine both reliability and 
security of data. Hence error control for network coding is vital. 
Recently, constant-dimension codes (CDCs), especially Kotter- 
Kschischang (KK) codes, have been proposed for error control in 
random linear network coding. KK codes can also be constructed 
from Gabidulin codes, an important class of rank metric codes. 
Rank metric decoders have been recently proposed for both 
Gabidulin and KK codes, but they have high computational 
complexities. Furthermore, it is not clear whether such decoders 
are feasible and suitable for hardware implementations. In this 
paper, we reduce the complexities of rank metric decoders and 
propose novel decoder architectures for both codes. The synthesis 
results of our decoder architectures for Gabidulin and KK codes 
with limited error-correcting capabilities over small fields show 
that our architectures not only are affordable, but also achieve 
high throughput. 

Index Terms — Constant-dimension codes (CDCs), Decoding, Er- 
ror correction coding, Gabidulin codes, Galois fields. Integrated 
circuits, Kotter-Kschischang codes. Network coding. Rank metric 
codes, Subspace codes. 



I. Introduction 

Network coding fT| is a promising candidate for a new 
unifying design paradigm for communication networks 121 . 
The basic idea of network coding is simple yet powerful. Tra- 
ditionally, an intermediate node in a network simply stores and 
forwards each packet it receives. Instead, in network coding, 
an intermediate node combines several incoming packets into 
one or several outgoing packets and forwards. This simple 
change has a significant ramification: the capacity of a mul- 
ticast (the minimum cut between the source node and any 
destination node) can be achieved by network coding, but not 
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by traditional networks without network coding fT\ . In addition 
to the throughput improvement, network coding is robust to 
nonergodic network failures and ergodic packet erasures, can 
be done in a distributed manner with low complexity, and 
improves mobility and management of wireless networks. Fur- 
thermore, implementations of network coding in both wired 
and wireless environments demonstrate its practical benefits 
(see, for example, |[3l, lH). Due to these advantages, network 
coding is already used or considered in gossip-based data 
dissemination |5|, 802.11 wireless ad hoc networking |4|, peer- 
to-peer networks |3|, and mobile ad hoc networks (MANETs) 
161. 

Random linear network coding (RLNC) Q is arguably the 
most important class of network coding. Instead of using net- 
work coding operations centrally designed to achieve the max- 
imum throughput based on the network topology, RLNC treats 
all packets as vectors over some finite field and forms an 
outgoing packet by linearly combining incoming packets us- 
ing random coefficients. Due to its random linear operations, 
RLNC not only achieves network capacity in a distributed 
manner, but also provides robustness to changing network con- 
ditions. Unfortunately, it is highly susceptible to errors caused 
by various reasons, such as noise, malicious or malfunctioning 
nodes, or insufficient min-cut fE\. Since linearly combining 
packets results in error propagation, errors greatly deteriorate 
the throughput of network coding and seriously undermine 
both reliability and security of data. Thus, error control for 
random linear network coding is critical and has received 
growing attention recently (see, for example, jSl, 10). 

Error control schemes proposed for RLNC assume two types 
of transmission models. The schemes of the first type depend 
on and take advantage of the underlying network topology or 
the particular linear network coding operations performed at 
various network nodes. The schemes of the second type fSl, 
|9 | assume that the transmitter and receiver have no knowledge 
of such channel transfer characteristics. The two transmission 
models are referred to as coherent and noncoherent network 
coding, respectively. 

It has been recently shown 1 8 1 that an error control code for 
noncoherent network coding, called a subspace code, is a set 
of subspaces (of a vector space), and information is encoded 
in the choice of a subspace as a codeword; a set of packets 
that generate the chosen subspace is then transmitted |8|. In 
contrast, a code in classical coding theory is a set of vectors, 
and information is encoded in the choice of a vector as a code- 
word. A subspace code is called a constant-dimension code 
(CDCs) if its subspaces are of the same dimension. CDCs 
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are of particular interest since they lead to simplified network 
protocols due to the fixed dimension. A class of asymptoti- 
cally optimal CDCs have been proposed in |8|, and they are 
referred to as the KK codes. A decoding algorithm based 
on interpolation for bivariate linearized polynomials is also 
proposed in [8J for the KK codes. It was shown that KK codes 
correspond to lifting Q of Gabidulin codes IfTOl , IfTTI . a class 
of optimal rank metric codes. Gabidulin codes are also called 
maximum rank distance (MRD) codes, since they achieve the 
Singleton bound in the rank metric ifTOl . as RS codes achieve 
the Singleton bound of Hamming distance. Gabidulin codes 
have found applications in storage systems [llj, space-time 
coding [12] as well as cryptography llT3l . and various decoding 
algorithms have been proposed for Gabidulin codes |T4]-|T8l. 
Due to the connection between Gabidulin and KK codes, the 
decoding of KK codes can be viewed as generalized decoding 
of Gabidulin codes, which involves deviations as well as errors 
and erasures |9|. Gabidulin codes are significant in themselves. 
For coherent network coding, the error correction capability 
of error control schemes is succinctly described by the rank 
metric |19|; thus error control codes for coherent network 
coding are essentially rank metric codes. 

The benefits of network coding above come at the price of 
additional operations needed at the source nodes for encoding, 
at the intermediate nodes for linear combining, and at the 
destination node(s) for decoding. In practice, the decoding 
complexities at destination nodes are much greater than the 
encoding and combining complexities. The decoding complex- 
ities of RLNC are particularly high when large underlying 
fields are assumed and when additional mechanisms such as 
error control are accounted for. Clearly, the decoding com- 
plexities of RLNC are critical to both software and hardware 
implementations. Furthermore, area/power overheads of their 
VLSI implementations are important factors in system design. 
Unfortunately, prior research efforts have mostly focused on 
theoretical aspects of network coding, and complexity reduc- 
tion and efficient VLSI implementation of network coding 
decoders have not been sufficiently investigated so far. For 
example, although the decoding complexities of Gabidulin and 
KK codes were analyzed in ||20| . ifSTI and HJ, lH, respectively, 
but they do not reflect the impact of the size of the underlying 
finite fields. To ensure high probability of success for RLNC, 
a field of size 2* or 2^^ is desired |22|. However, these large 
field sizes will increase decoding complexities and hence com- 
plicate hardware implementations. Finally, to the best of our 
knowledge, hardware architectures for these decoders have not 
been investigated in the open literature. 

In this paper, we fill this significant gap by investigating 
complexity reductions and efficient hardware implementations 
for decoders in RLNC with error control. This effort is signifi- 
cant to the evaluation and design of network coding for several 
reasons. First, our results evaluate the complexities of decoders 
for RLNC as well as the area, power, and throughput of their 
hardware implementations, thereby helping to determine the 
feasibility and suitability of network coding for various ap- 
plications. Second, our research results provide instrumental 
guidelines to the design of network coding from the perspec- 
tive of complexity as well hardware implementation. Third, our 



research results lead to efficient decoders and hence reduce the 
area and power overheads of network coding. 

In this paper, we focus on the generalized Gabidulin decod- 
ing algorithm [91 for the KK codes and decoding algorithm 
in 1 16] for Gabidulin codes for two reasons. First, compared 
with the decoding algorithm in |8 |, the generalized Gabidulin 
decoding (|9l has a smaller complexity, especially for high- 
rate KK codes f9l. Second, components in the errors-only 
Gabidulin decoding algorithm in |16| can be easily adapted in 
the generalized Gabidulin decoding of KK codes. Thus, among 
the decoding algorithms for Gabidulin codes, we focus on 
the decoding algorithm in [16|. Finally, although we focus on 
random linear network coding with error control in this paper, 
our results can be easily applied to random linear network 
coding without error control. For random linear network cod- 
ing without error control, the decoding complexity is primarily 
due to inverting of the global coding matrix via Gauss-Jordan 
elimination, which is considered in this paper 

One of our main contributions include several algorithmic 
reformulations that reduce the computational complexities of 
decoders for both Gabidulin and KK codes. Our complexity- 
saving algorithmic reformulations are: 

> We first adopt normal basis representations for all finite 
field elements, and then significantly reduce the complex- 
ity of bit-parallel normal basis multipliers by using our 
common subexpression elimination (CSE) algorithm; 

• The decoding algorithms of both Gabidulin and KK codes 
involve solving key equations. Based on approaches in 
1231 . f24|, we reformulate the Berlekamp-Massey algo- 
rithm (BMA), and propose an inversionless BMA. Our 
inversionless BMA does not require any finite field in- 
versions, and leads to reduced complexities as well as 
efficient architectures; 

• The decoding algorithm of KK codes requires that the 
input be arranged in a row reduced echelon (RRE) form. 
We define a more generalized form called n-RRE form, 
and show that it is sufficient if the input is in the n-RRE 
form. This change not only reduces the complexity of re- 
formulating the input, but also enables parallel processing 
of decoding KK codes based on Cartesian products. 

The other main contribution of this paper is efficient decoder 
architectures for both Gabidulin and KK codes. Aiming to 
reduce the area and to improve the regularity of our decoder 
architectures, we have also reformulated other steps in the 
decoding algorithm. To evaluate the performance of our de- 
coder architectures for Gabidulin and KK codes, we implement 
our decoder architecture for two rate-1/2 Gabidulin codes, an 
(8, 4) code and a (16, 8) code, and their corresponding KK 
codes. Our KK decoders can be used in network coding with 
various packet lengths by Cartesian product |9|. The synthesis 
results of our decoders show that our decoder architectures 
for Gabidulin and KK codes over small fields with limited 
error-correcting capabilities not only are affordable, but also 
achieve high throughput. Our decoder architectures and imple- 
mentation results are novel to the best of our knowledge. 

The rest of the paper is organized as follows. After briefly re- 
viewing the background in SectionHH we present our complexity- 
saving algorithmic reformulations and efficient decoder archi- 
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tectures in Sections |III] and |IV| respectively. In Section |Vl B. Gabidulin Codes and Their Decoding 



the proposed architectures are implemented in Verilog and 
synthesized for area/performance evaluation. The conclusion 
is given in Section IVll 



II. Preliminaries 



A. Notation 

Let q denote a power of prime and F^™ denote a finite 
field of order q™. We use F^"^^™ to denote the set of all nxm 
matrices over F, and use /„ to denote an nxn identity matrix. 
For a set U C {0, 1, . . . ,n — 1}, W denotes the complement 
subset {0, 1, . . . , n — 1} \ W and lu denotes the columns of J„ 
in !J. In this paper, all vectors and matrices are in bold face. 

The rank weight of a vector over F^™ is defined as the 
maximal number of its coordinates that are linearly indepen- 
dent over the base field Fg. Rank metric is the weight of vector 
difference [25|. For a column vector X e F^™, we can expand 
each of its component into a row vector over the base field 
¥q. Such a row expansion leads to an n x m matrix over F,. 
In this paper, we slightly abuse the notation so that X can 
represent a vector in F^,„ or a matrix in F^^™, although the 
meaning is usually clear given the context. 

Given a matrix X, its row space, rank, and reduced row 
echelon (RRE) form [261 are denoted by (X), rankX, and 
RRE(X), respectively. For a subspace (X), its dimension is 
denoted by dim(X) and lankX — dim(X). The rank dis- 
tance of two vectors X and Y in F^™ is defined as dji{X, Y) = 
rank(X — Y). The subspace distance lH) of their row spaces 
(X), {Y) is defined as ds{{X), {Y)) = dim(X)+dim(r) - 
2dim((X) n {Y)). 

Let the received matrix be 1^ = [A \ y], where A G 
F^^" and y e 'pNxm^ Note that we always assume the 
received matrix is full-rank [9|. The row and column rank 
deficiencies of A are 6 — N ~ rank A and fi = n ~ rank A, 
respectively. Then the RRE form of Y is expanded into Y = 
l''^' ° ]RRE(F) = ['"+^'S where W denotes the col- 
umn positions of leading entries in the first n rows of RRE(l^). 
The tuple (r, L, E) is called a reduction of Y (jO)- 

Polynomials over Fgm with only non-zero terms with de- 
grees are called linearized polynomials | |27| , f28 |. For con- 
venience, let [i] denote In a linearized polynomial, the 
greatest value of i of non-zero terms is defined as its q-degree. 
The symbolic product of two linearized polynomials a{x) and 
b{x), denoted by ® (that is, a{x) b{x) = a{b{x))), is also 
a linearized polynomial. The g-reverse of a linearized polyno- 



A Gabidulin code ifTol is a linear {n, k) code over Fgr, 
whose parity-check matrix has a form as 



mial f{x) = J2i=o fi^ is given by the polynomial g{x) ~ 
ELoffi^'*'' where gi = f^^zt for i = 0, 1, . . . ,p and p 
is the q-degree of f{x). For a set a. of field elements, we 
use minpoly(Q:) to denote its minimal linearized polynomial, 
which is the monic linearized polynomial of least degree such 
that all the elements of a are its roots. 
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where ho, hi, ... , hn-i G F,™ are linearly independent over 
¥g. Since Fgm is an m-dimensional vector space over F^, it is 
necessary that n < m. Conversely, a generator matrix of this 
Gabidulin code is given by 
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where ^I^Zq g\'^ '^'h]^' — 0, fors = 0, 1, . . . , n — 1. The mini- 
mum rank distance of a Gabidulin code isd = n — k + 1 and 
all Gabidulin codes are MRD codes. 

The decoding process of Gabidulin codes includes five ma- 
jor steps: syndrome computation, key equation solver, finding 
the root space, finding the error locators by Gabidulin's al- 
gorithm ifTOl . and finding error locations. The data flow of 
Gabidulin decoding is shown in Figure [T] 

Most key equation solvers are based on a modified Berlekamp- 
Massey algorithm (BMA) lfT6l or a modified Welch-Berlekamp 
algorithm (WBA) |l29). Other decoding algorithms include [30], 
II3TI . The decoding algorithms in [31] are for cryptography 
applications and have very high complexity. In this paper, we 
focus on the modified BMA. 

As in RS decoding, we can compute syndromes for Gabidu- 
lin codes as 5 = (5*0, 6*1, ... , 6*^-2) — Hr for any received 
vector r. Then the syndrome polynomial S{x) = X]j=o Sjx^^^ 
can be used to solve the key equation 

ct(x) «) S{x) = io{x) mod x^"^'^^ (3) 

for the error span polynomial <t{x), using the BMA. Up to 
t = [{d — 1)/2J error values Ej's can be obtained by finding 
a basis Eq, Ei, . . . for the root space of (j{x) using the meth- 
. Then we can find the error locators X, 's 



ods in 

corresponding to Ej's by solving a system of equations 



^3 

T-l 



(4) 



Si=J2^J^^^ / = 0,l,...,d-2 

3=0 

where r is the number of errors. Gabidulin's algorithm ifTOl 
in Algorithm [T] can be used to solve (|4|l. Finally, the error 
locations L/s are obtained from X/s by solving 

Xj = ^Lj^khi, j = 0, 1, . . . ,r - 1. (5) 

i=0 

Algorithm 1 (Gabidulin's Algorithm ifTOl ). 
Input: So,Si,...,Sd-2 and Eq, Ei, . . . , E^-i 
Output: Xo,Xi,.. .,X-r-i 
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Fig. 1. Data flow of Gabidulin decoding 



[T]l Compute TXT matrices A and Q as 



A,, = { 0, 



i = 

i ^0,j < i 



Q^ 



'-i.j + i 



i = 
otherwise. 



\T\2 Compute Xi's recursively as Xr-i = Qr-i.o/j4r-i.T-i 
and Xi = ((5i,o - Z]I=iVi ^ij"^i)M»,i' for i = r - 



2,r-3, 



,0. 



In total, the decoding complexity of Gabidulin codes is 
roughly 0(?i^(l — R)) operations over ¥qm ||20l . where R 
is the code rate or 0{dm?) operations over |(2T|. Note that 
all polynomials involved in the decoding process are linearized 
polynomials. 

Gabidulin codes are often viewed as the counterpart in rank 
metric codes of the well-known Reed-Solomon codes. As 
shown in Table U an analogy between Reed-Solomon and 
Gabidulin codes can be established in many aspects. Such an 
analogy helps us understand the decoding of Gabidulin codes, 
and in some cases allows us to adapt innovations proposed for 
Reed-Solomon codes to Gabidulin code. 

TABLE I 

Analogy between Reed-Solomon and Gabidulin Codes 





Reed-Solomon 


Gabidulin 


Metric 


Hamming 


Rank 


Ring of 


Polynomials 


Linearized Polynomials 


Degree 


i 


[i\ = t 


Key Operation 


Polynomial Multiplication 


Symbolic Product 


Generation Matrix 






Parity Check Matiix 






Key Equation Solver 


BMA 


Modified BMA 


Error Locations 


Roots 


Root Space Basis 


Error Value Solver 


Forney's Formula 


Gabidulin's Algorithm 



C. KK Codes and Their Decoding 

By the lifting operation |9|, KK codes can be constructed 
from Gabidulin codes. Lifting can also be seen as a gener- 
alization of the standard approach to random linear network 
coding |7|, which transmits matrices in the form X = [J | a;], 
where X G F^^*^ x e F^^", and M - n. Hence by 
adding the constraint that x is the row expansion of codewords 
from a Gabidulin code C over F^m, error control is enabled 
in network coding. 

In practice, the packet length could be very long. To accom- 
modate long packets based on the KK codes, very large m and 



n are needed, which results in prohibitively high complexity 
due to the huge field size of ¥qm . A low-complexity approach 
in |9| suggested that instead of using a single long Gabidulin 
code, a Cartesian product of many short Gabidulin codes with 
the same distance can be used to construct constant-dimension 
codes for long packets via the lifting operation. 

In the decoding algorithm of ||9l, the matrix Y is first turned 
into an RRE form. Based on an RRE form of Y, a three- 
tuple {r, L, E), referred to as a reduction of Y, is obtained. 
It was proved in that dsi{X},(Y)) = 2rank[-^ ''T^] -^-(5, 

where /i — n—rankL and S = N—ia.nkL. Now the decoding 
problem to minimize the subspace distance becomes a problem 
to minimize the rank distance. 

The generalized rank decoding |9| finds an error word e — 
argniingg^_(, rank[^ J,]. The error word e is expanded as a 
summation of products of column and row vectors such 
that e = -^i-^i- Each term LjEj is called either an 

erasure, if Lj is known, or a deviation, if Ej is known, or 
an error, if neither Lj nor Ej is known. In this general de- 
coding problem, L has /i columns from L and E has 6 rows 
from E. Given a Gabidulin code of minimum distance d, the 
corresponding KK code is able to correct e errors, /i erasures, 
and S deviations as long as if 2e + /i + (5 < d. 

Algorithm |2] was proposed f9 1 for generalized decoding of 
the KK codes, and its data flow is shown in Figure |2] It 
requires 0{dm) operations in F^m 0. 

Algorithm 2 (General Rank Decoding [9l). 
Input: received tuple (r, L, E) 
Output: error word e 
|2]1 Compute S = Hr, X L^h, \u{x) = minpoly(X), 
(Jd{x) = minpoly(£;), and Sdu{x) — (Td{x)®S{x)® 
Cu{x), where Cu{x) is the g-reverse of Xu{x). 
12] 2 Compute the error span polynomial: 

a) Use the modified BMA llT6l to solve the key equa- 
tion apix) <Si Sdu{x) = u}{x) mod such 
that dega;(x) < [r] where r = e + /i + 5. 

b) Compute Sfd{x) — (Jf{x) (8> c7d{x) (8) S{x). 

c) Use Gabidulin's algorithm |fTOl to find f3 that solves 
Sfd,i = E^=o ' ^ = d-2,d-3,...,d- 

d) Compute au{x) = minpoly(/3) followed by a{x) = 

(Tu{x) ® <7f{x) ® <7d{x). 

12] 3 Find a basis E for the root space of (j{x). 
|2]4 Find the error locations: 



a) Solve Si = Y.]Zl xfE,,l = Q,l,...,d-2 using 
Gabidulin's algorithm ||TOl to find the error locators 

Xo, Xi, . . . , Xr-l £ F„m. 
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b) Compute the error locations -Lj's by solving (|5]l. 

c) Compute the error word e = ^j-^j' where 
each Ej is the row expansion of Ej. 

III. Computational Complexity Reduction 

In general, RLNC is carried out over F,, where q is any 
prime power That is, packets are treated as vectors over F,. 
Since our investigation of computational complexities is for 
both software and hardware implementations of RLNC, where 
data are stored and transmitted in bits, we focus on RLNC over 
characteristic- 2 fields in our work, i.e., g is a power of two. 
In some cases, we further assume g = 2, as it leads to further 
complexity reductions. 

A. Finite Field Representation 

Finite field elements can be represented by vectors using 
different types of bases: polynomial basis, normal basis, and 
dual basis ||34l . In rank metric decoders, most polynomials 
involved are linearized polynomials, and hence their evalu- 
ations and symbolic products require computing their [i]th 
powers. Suppose a field element is represented by a vector 
over ¥p with respect to a normal basis, computing [i]th powers 
(i is a positive or negative integer) of the element is simply 
cyclic shifts of the corresponding vector by i positions, which 
significantly reduces computational complexities. For example, 
the computational complexity of Algorithm [T] is primarily due 
to the following updates in Step[T]l: 

f;-''-' (6) 

which require divisions and computing [— l]th powers. With 
normal basis representation, [— l]th powers are obtained by 
a single cyclic shift. Furthermore, when q — 2, they can be 
computed in an inversionless form 

which also avoids finite field divisions or inversions. Thus 
using normal basis representation also reduces the complexity 
of Gabidulin's algorithm. 

In addition to reduce complexities of finite field arithmetic 
operations, normal basis representation leads to reduced com- 
plexities in the decoding of Gabidulin and KK codes for sev- 
eral reasons. First, it was shown that using normal basis can 
facilitate the computation of symbolic product ll20ll . Second, 
it was also suggested |20| that solving (|5]) can be trivial using 
normal basis. When a normal basis is used as hi's and also 
used for representation, the matrix h, whose rows are vector 
representations of hi's with respect to the basis /i^'s, becomes 
an identity matrix with additional all-zero columns. Hence 
solving (|5]l requires no computation. These two complexity 
reductions were also observed in Ii2 1 1 . Third, if a normal basis 
of F2m is used as /i/s and n = m, the parity check matrix H 
in ([T]) becomes a cyclic matrix. Thus syndrome computation 



becomes part of a cyclic convolution of (/iq, /ii, . . . , /im-i) 
and r, for which fast algorithms are available \35\, |l36|. Using 
fast cyclic convolution algorithms are favorable when m is 
large. 

B. Normal Basis Arithmetic Operations 

We also propose finite field arithmetic operations with re- 
duced complexities, when normal basis representation is used. 
When represented by vectors, the addition and subtraction of 
two elements are simply component-wise addition, which is 
straightforward to implement. For characteristic-2 fields F2™, 
inverses can be obtained efficiently by a sequence of squar- 
ing and multiplying, since /3^^ = — . . . /3^" 
for (3 G F2™ ll34l . Since computing [i]th powers requires 
no computation, the complexity of inversion in turn relies 
on that of multiplication. Division can be implemented by a 
concatenation of inversion and multiplication: a/ (3 = a ■ /3~^, 
and hence the complexity of division also depends on that of 
multiplication in the end. 

There are serial and parallel architectures for normal basis 
finite field multipliers. To achieve high throughput in our de- 
coder, we consider only parallel architectures. Most normal 
basis multipliers are based on the Massey-Omura (MO) ar- 
chitecture [34], fyi]. The complexity of a serial MO normal 
basis multiplier, Cn, is defined as the number of terms Uibi in 
computing a bit of the product c = ab, where a^'s and bi's are 
the bits of a and b, respectively. It has been shown f38l that 
a parallel MO multiplier over F2™ needs AND gates and 
at most m{CN + m — 2)/2 XOR gates. For instance, for the 
fields F2B and F216, their Cat's are minimized to 21 and 85, 
respectively I.34J. Using a common subexpression elimination 
algorithm 1391 , we significantly reduce the number of XOR 
gates while maintaining the same critical path delays (CPDs) 
of one AND plus five XOR gates and one AND plus seven 
XOR gates as direct implementations, respectively. Our results 
are compared to those in |34|, |38| in Table HH where we also 
provide the prime polynomial P{x) for each field. 

The reduced gate count for normal basis multiplication is 
particularly important for hardware implementations of RLNC. 
This improvement is transparent to the complexity of decoders, 
in terms of finite field operations. When decoders for RLNC 
are realized in hardware, the reduced gate count for normal 
basis multiplication will be reflected in reduced area and power 
consumption. 

C. Inversionless BMA 

The modified BMA for rank metric codes lfT6l is similar to 
the BMA for RS codes except that polynomial multipUcations 
are replaced by symbolic products. The modified BMA |T6| 
requires finite field divisions, which are more complex than 
other arithmetic operations. Following the idea of inversion- 
less RS decoder |l23|, we propose an inversionless variant in 
Algorithm [3] 

Algorithm 3. iBMA 
Input: Syndromes S 
Output: K{x) 
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Fig. 2. Data flow of KK decoding 

TABLE II 

Complexities of bit-parallel normal basis multipliers over finite fields (For these two fields, all three implementations have 

the same cpd.) 
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13 1 Initialize: A(")(a;) ^ = a;M, T^o) = 1, and 

L = 0. 

|3]2 Forr = 0,l,...,2i-1, 

a) Compute the discrepancy = X^j^o ^P^^r^~j- 

b) If Ar = 0, then go to (e). 

c) Modify the connection polynomial: bS'''^^\x) = 
(rW)[ilAW(x) - A,,a;[il (g) B^'-\x). 

d) If 2L > r, go to (e). Otherwise, L = r + 1 - L, 
Yir+i) ^ and B('^)(a;) = AW(a;). Go to (a). 

e) Set r(''+i) = (r('-))[i] and B('^+i)(a;) = xl^] 
BW(a;). 

1213 Set A(x) = A(2*)(a;). 

Using a similar approach as in |23|, we prove that the 
output A(a;) of Algorithm [3] is the same as a{x) produced 
by the modified BMA, except it is scaled by a constant C = 
11^=0 (r'^^*'')'^'- However, this scaling is inconsequential since 
the two polynomials have the same root space. 

Using normal basis, the modified BMA in [16] requires at 
most [{d — 2)/2j inversions, {d — l){d — 2) multiplications, 
and {d~ l){d — 2) additions over F^m |20'|. Our inversionless 
version. Algorithmic] requires at most {3/2)d{d — 1) multi- 
plications and (d — l)(<i — 2) additions. Since a normal basis 
inversion is obtained by m — 1 normal basis multiplications, 
the complexity of normal basis inversion is roughly m — 1 
times that of normal basis multiplication. Hence, Algorithm [3] 
reduces the complexity considerably. Algorithm|3]is also more 
suitable for hardware implementation, as shown in Section HV] 

D. Finding the Root Space 

Instead of finding roots of polynomials in RS decoding, we 
need to find the root spaces of linearized polynomials in rank 
metric decoding. Hence the Chien search in RS decoding will 
have a high complexity for two reasons. First, it requires poly- 
nomial evaluations over the whole field, whose complexity is 



very high; Second, it cannot find a set of linearly independent 
roots. 

A probabilistic algorithm to find the root space was pro- 
posed in |33|. For Gabidulin codes, it can be further sim- 
plified as suggested in 120|. But hardware implementations 
of probabilistic algorithms require random number generators. 
Furthermore, the algorithm in ||33l requires symbolic long divi- 
sion, which is also not suitable for hardware implementations. 
According to |9|, the average complexity of the probabilistic 
algorithm in [331 is 0{dm) operations over F^m, while that of 
Berlekamp's deterministic method |[32l is 0{dm) operations 
in ¥qm plus 0(m'^) operations in ¥q. Thus their complexity 
difference is small, and hence we focus on the deterministic 
method since it is much easier to implement. 

Suppose we need to find the root space of a linearized poly- 
nomial r{x), Berlekamp's deterministic method first evaluates 
the polynomial r{x) on a basis of the field (ao, ai, . . . , am-i) 
such that Vi = r(a,;), « = 0, 1, . . . , m — 1. Then it expands w^'s 
in the base field as columns of an m x m matrix V and finds 
linearly independent roots z such that Vz = 0. Using the 
representation based on {ao,ai, . . . ,am~i), the roots z are 
also the roots of the given polynomial. Finding z is to obtain 
the linear dependent combinations of the columns of V, which 
can be done by Gaussian elimination. 

E. n-RRE Form 

Given a received subspace spanned by a set of received 
packets, the input of Algorithm |2] is a three-tuple, called a 
reduction of the received space represented by its generator 
matrix Y; the three-tuple is obtained based on Y when it is 
in its RRE form |9|. Thus, before the decoding starts, prepro- 
cessing is performed on the received packets so as to obtain 
the RRE form of Y. We show that Y needs to satisfy only a 
relaxed constraint. The relaxed constraint on Y does not affect 
the decoding outcome, while leading to two advantages. First, 
the relaxed constraint results in reduced complexities in the 
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preprocessing step. Second and more importantly, the relaxed 
constraint enables parallel processing of decoding KK codes 
based on Cartesian products. 

We first define an n-RRE form for received matrices. Given 
a matrix Y = [A \ y], where A e F^^" and y e F^''™, the 
matrix Y is in its n-RRE form as long as A (its leftmost n 
columns) is in its RRE form. Compared with the RRE form, 
the n-RRE form is more relaxed as it puts no constraints on 
the right part. We note that an n-RRE form of a matrix is not 
unique. 

We first show that the relaxed constraint does not affect the 
decoding. Similar to |40, Proposition 7], we first show that a 
reduction based on n-RRE form of Y always exists. Given 
Y = [A \ y] and RRE(A) = RA where the matrix R 
represents the reducing operations, the product Y' = RY = 
[B' I Z'] is in its n-RRE form. We note that B' e F^^" 
and Z £ F^^ '", where the column and row rank deficiency 
of B' are given by /i' = n — rank B' and S' — N — rank B', 
respectively. We have the following result about the reduction 
based on Y' . 

Lemma 1. Let Y' and fi' and S' be defined as above. There 
exists a tuple (r ',£',£:') G F^^" x F^'^^' x F^'^™ and a set 
U' satisfying 

l^'l = m' 
lS,r' = 

IS'L' = 
ranki;' = 5' 



l/l' XfJ,' 



so that 







E' 



(8) 



Proof: See Appendix lAl ■ 
Lemma [T] shows that we can find an alternative reduction 
based on n-RRE form of Y, instead of an RRE form of Y. 
The key of our alternative reduction of Y is that the reduction 
is mostly determined by the first n columns of RRE(l^). Also, 
this alternative reduction does not come as a surprise. As 
shown in 1401 Proposition 8], row operations on E can produce 
alternative reductions. Next, we show that decoding based on 
our alternative reduction is the same as in Ii40il . Similar to |40, 
Theorem 9], we have the following results. 

Lemma 2. Let {r',L',E') be a reduction of Y determined 
by its rt-RRE form, we have 



dsi{X),(Y))^2vank 



L' 




r X 
E' 



-fi'- 6'. 



Proof: See Appendix IB] ■ 
Lemma [2] shows that the subspace decoding problem is 
equivalent to the generalized Gabidulin decoding problem with 
the alternative reduction {r' , L' , E'), which is obtained from 
an n-RRE form of Y. We illustrate an example of KK decod- 
ing based on n-RRE approach in Appendix [Cl 

Our alternative reduction leads to two advantages. First, 
it results in reduced complexity in preprocessing. Given a 



matrix Y, the preprocessing needed to transform Y into its 
n-RRE form is only part of the preprocessing to transform Y 
into its RRE form. We can show that the maximal number 
of arithmetic operations in the former preprocessing is given 
by {N — 1) X^il"? ^~^(" + m — i), whereas that of the lat- 
ter preprocessing is (N — 1) Y^l^Q^'^^^^in + m ~ i). Since 
rankl^ > rank A, the relaxed constraint leads to a lower 
complexity, and the reduction depends on rank Y and rank A. 

Second, the reduction for n-RRE forms is completely deter- 
mined by the n leftmost columns of Y instead of the whole 
matrix, which greatly simplifies hardware implementations. This 
advantage is particularly important for the decoding of constant- 
dimension codes that are lifted from Cartesian products of 
Gabidulin codes. First, since the row operations to obtain an 
n-RRE form depend on A only, decoding [A | yo I Z/i I • • • I 
yi-i] can be divided into parallel and smaller decoding prob- 
lems whose inputs are [A \ yo], [A \ yi], . . . ,[A \ yi-i]. Thus, 
for these constant-dimension codes, we can decode in a serial 
manner with only one small decoder, or in a partly parallel 
fashion with more decoders, or even in a fully parallel fashion. 
This flexibility allows tradeoffs between cost/area/power and 
throughput. Furthermore, since the erasures L is determined 
by A and is the same throughout all [A \ yi], the computation 
of X and X[j{x) in Algorithm |2] can be shared among these 
parallel decoding problems, thereby reducing overall complex- 
ity. 

F. Finding Minimal Linearized Polynomials 

Minimal linearized polynomials can be computed by solving 
systems of linear equations. Given roots f3o, (3i, . . . , /3p_i, the 
minimal linearized polynomial x^p^ + J^^Zq a^a;!'! satisfies 



[0] 



.41 



'0 



p-i 













_ap_i_ 







(9) 



Thus it can be solved by Gaussian elimination over the exten- 
sion field Fqm. Gabidulin's algorithm is not applicable because 
the rows of the matrix are not the powers of the same element. 

The complexity to solve (|9j is very high. Instead, we re- 
formulate the method from |27l Theorem 7]. The main idea 
of |27l Theorem 7] is to recursively construct the minimal 
linearized polynomial using symbolic products instead of poly- 
nomial multiplications in polynomial interpolation. Given lin- 
early independent roots wq, wi, . . . , Wp-i, we can construct a 
series of linearized polynomials as 



Fo{x) 
Fi{x) 



In the method in |f27l Theorem 7], the evaluation of Fi{wj) 
has an increasing complexity when the degree of Fi{x) gets 
higher To facilitate the implementation, we reformulate the al- 
gorithm to divide the evaluation into multiple steps. The idea is 
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based on the fact that = (a;!^! -(i^j_i(wi))''"^x[°l)® 

Fi^i{wi+i). The reformulated algorithm is given in Algo- 
rithm |4] Thus the evaluation can also be done in a recursive 
way. 

Algorithm 4 (Algorithm for Minimal Linearized Polynomials). 

Input: Roots wg, Wi, . . . , Wp-i 

Output: The minimal linearized polynomial F^p'^Ix) 

Ell Initialize 7]°^ = Wj, for j = 0, 1, . . . and F^°\x) = 

1112 For i 0, 1, . . . ,p - 1, if 7!'^ 7^ 0, 

a) = + {j^'^)"'^ F^'^ (x); 

b) For j = ^ + l,^ + 2,...,p-l, 7f+'' = (if )™ + 



Since powers of q require only cyclic shifting, the operations 
in Algorithm m are simple. Another advantage of Algorithm |4l 
is that it does not require the roots to be linearly independent. 
In Algorithm S F^''>{wj) = for j = 0, 1, . . . , i - 1 and 
^(i+i) _ F^''\wi). If wo,wi, . . . are linearly dependent, 
^i^^ = and hence wj is ignored. So Algorithm |4l integrates 
detection of linearly dependency at no extra computational 
cost. 

Essentially, Algorithm 21 breaks down evaluations of high 
q-degree polynomials into evaluations of polynomials with q- 
degree as one. It gets rid of complex operations while main- 
taining the same total complexity of the algorithm. 



IV. Architecture Design 

Aiming to reduce the storage requirement and total area as 
well as to improve the regularity of our decoder architectures, 
we further reformulate the steps in the decoding algorithms of 
both Gabidulin and KK codes. Again, we assume the decoder 
architectures are suitable for RLNC over F^, where g is a 
power of two. 

A. High-Speed BMA Architecture 

To increase the throughput, regular BMA architectures with 
shorter CPD are necessary. Following the approaches in 12411 . 
we develop two architectures based on Algorithmic] which are 
analogous to the riBM and RiBM algorithms in lf24l . 

In Algorithm [3] the critical path is in step 3.2(a). Note 
that Ar is the rth coefficient of the discrepancy polynomial 
AW(a;) = AW(a;) «> S{x). By using e('''(a;) = ^^(a;) (E> 
S{x), A('"+^)(a;) can be computed as 



A^^'+^Hx) = A^^'+^^x) ® Six) 
= (rW)WAW(a;)- 

= (r('-))WAW(x) - 



Arx^^^ (^Q^'-^x) 



(g) S{x) 
(10) 



which has the structure as step 3.2(c). Hence this reformulation 
is more conducive to a regular implementation. 

Given the similarities between step 3.2(a) and ( fTOl i, A(a::) 
and A (a;) can be combined together into one polynomial A{x). 



Similarly, B{x) and Q{x) can be combined into one polyno- 
mial 8(x). These changes are incorporated in our RiBMA 
algorithm, shown in Algorithm |5l 

Algorithm 5. RiBMA 
Input: Syndromes S 
Output: A(a;) 

mi Initialize: A(o)(a;) = e^°\x) = J2f=o^ -S'.xW, r(o) = 1, 



A 



(0) 
3t 



e 



3t 



1, and 6 = 0. 



|5l2 Forr = 0,l,...,2t-l, 

a) Modify the combined polynomial: A(''+^)(a:) 
r('-)AM(a;) - A*,''^eW(x); 

b) Set 6 = 6 + 1; 



(r) 



c) If A^''^ / and 6 > 0, set b = -b, r(''+i) = A^' 
and eM(a;) = AM(a;); 

d) Set A('-+i)(a:) = E'io'^i+t'^^'''' = 

e) Set r('-+i) = (r('-))[il and e(''+i)(.x) = x^^^ Cg) 
eW(a;). 

1213 SetA(x)=Elo4>^'^'- 

Following Algorithm |5l we propose a systolic RiBMA ar- 
chitecture shown in Fig. [3l which consists of 3t + 1 identical 
processing elements (BEs), whose circuitry is shown in Fig.|4l 
The central control unit BCtrl processes b inside, generates 
the global control signals ct*^''^ and T^'^\ and passes along the 
coefficient Aq'^''. The control signal ct'^''^ is set to 1 only if 
A^^^ 7^ and fc > 0. In each processing element, there are 
two critical paths, both of which consist of one multiplier and 
one adder over ¥2^. One starts from Aj^^j^ and the other 



while both end in A 



(r) 




Fig. 4. The processing element BE; (x' is a cyclic shift, and requires no 
hardware but wiring) 



B. Generalized BMA 

The key equation of KK decoding is essentially the same as 
(O but uj{x) has q-degree less than r, instead of [{d — 1)/2J. 
Actually, in KK decoding, we do not know the exact value 
of T before solving the key equation. But all we need is to 
determine the maximum number of correctable errors t' given 
/i erasures and S deviations, which is given by t' = [{d — 
1 — /i — S)/2\. Hence we need to generalize our previous 



9 





























Ao At 

T T 










BEo 




BEf_i 




BEt 




BE2t 




BE2f+i 




BEsf 


^ u 


BCtrl 


















































^0 



Fig. 3. The RiBMA architecture 



BMA architectures in Section IIII-CI for KK decoding, as in 
Algorithm |6] To apply Algorithm |6] to Gabidulin decoding, 
we can simply use 6 = fi + 5 = 0. 

Algorithm 6 (Generalized RiBMA). 
Input: S and 9 
Output: A(x) 

mi Initialize as follows: t' = [{d - 1 - e)/2], A^^^x) = 

r(0) = 1, and 6 = 0. 
|6l2 Forr = 0,l,...,2t'-1, 

a) Modify the combined polynomial: A^^'^^\x) = 
rWAW(2;) - A[,''^eM(a;); 

b) Set 6 = 6 + 1; 

c) If A[,'^^ 7^ and 6 > 0, set b = -b, T^'^+i) ^ aI[\ 
and eW(a;) = AW(2;); 

d) Set A('-+i)(a;) = E'I'o^*"' 4>t'^a;W, eW(a;) = 

e) Set r(''+i) = (r('-))[il and O'-^'+'^^x) = ® 
eW(a;). 

1113 SetA(:r) = EtoAl+?^W. 

Compared with Algorithm |5] we replace t by t' . The vari- 
able t' makes it difficult to design regular architectures. By 
carefully initializing A'^'^)(a::) and Q^^^x), we ensure that the 
desired output A(a::) is always at a fixed position of A^^* K^)^ 
regardless of ^ + S. Hence, the only irregular part is the 
initialization. The initialization of Algorithm |6] can be done 
by shifting in at most 6 cycles. Hence we reuse the RiBMA 
architecture in Fig. [3] in the KK decoder and keep the same 
worse-case latency of 2t cycles. 

C. Gaussian Elimination 

We need Gaussian elimination to obtain n-RRE forms as 
well as to find root spaces. Furthermore, Gabidulin's algo- 
rithm in Algorithm [T] is essentially a smart way of Gaussian 
elimination, which takes advantage of the properties of the 
matrix. The reduction (to obtain n-RRE forms) and finding 
the root space are Gaussian eliminations on matrices over ¥q, 
while Gabidulin's algorithm operates on matrices over F^™ . In 
this section, we focus on Gaussian eliminations over ¥q and 
Gabidulin's algorithm will be discussed in Section HV-DI 

For high-throughput implementations, we adopt the pivoting 
architecture in BTl . which was developed for non-singular 
matrices over F2. It always keeps the pivot element on the 
top-left location of the matrix, by cyclically shifting the rows 
and columns. Our Gaussian elimination algorithm, shown in 



Algorithm |7] has three key differences from the pivoting archi- 
tecture in BTl . First, Algorithm|7]is applicable to matrices over 
any field. Second and more importantly. Algorithm [T] detects 
singularity and can be used for singular matrices. This feature 
is necessary since singular matrices occur in the reduction for 
the RRE form and finding the root space. Third, Algorithm [T] 
is also flexible about matrix sizes, which are determined by 
the variable numbers of errors, erasures, and deviations. 

Algorithm 7 (Gaussian Elimination for Root Space). 

Input: m X TO matrices M € F™^™, whose rows are eval- 
uations of a{x) over the normal basis, and B — I 
Output: Linearly independent roots of (y{x) 

Hi Set i = 0. 

I2I2 For j = 0, 1, . . . ,m - 1 

a) While Mo,o = and j < m - i, shiftup(M, i) 
and shiftup(-B, i). 

b) If Mo,o is not zero, eliminate (Ai"), reduce(B, M), 
and i = i + Otherwise, shiftleft(M). 

|2l3 The last m — i rows of M are all zeros and the last 
m — i rows of B are roots. 

The eliminate and shiftup operations are quite similar to 
those in 141j Algorithm 2]. In eliminate(M), for all < 
z < TO and < j < TO, M^j = Mo,oM,+i_(j+i) m - 

^^j+l,oAfo,(j+l) mod m if « < m-l, Or Mi^j — Mo_(j+i) mod m- 

The procedure reduce(B, M) essentially repeats eliminate 
without column operations. For all < « < m and < j < m, 
Bij = A/o,o^i+ij — Afj+i^o-Boj if * < m — 1, otherwise 
Bij = Bqj. In the shiftup(A4', p) operation, the first row 
is moved to the (m — 1 — p)th row while the second to the 
(to — 1 — p)th rows are moved up. That is, for < j < m, 
Mij — Mqj if i = TO — 1 — p, and Mi,j = A/^+ij- for 
0<i<TO — 2 — p. In the shiftleft operation, all columns 
are cyclic shifted to the left except that the first column is 
moved to the last. In other words, for all < i < to and 
< j < m, Alij — Mi i^j^ij mod m- By adding a shiftleft 
operation. Algorithm [T] handles both singular and non-singular 
matrices while 1,4 1. Algorithm 2] only works on non-singular 
matrices. Since B is always full rank, the roots obtained are 
guaranteed to be linearly independent. 

We can get the root space using Algorithm |7] and we can 
also use it in KK decoding to reduce the received vector to an 
n-RRE form. But Algorithm |7] only provides E', we need to 
extend it to obtain L', as in Algorithm [8] 

Algorithm 8 (Gaussian Elimination for n-RRE Forms). 
Input: N X n matrix A and N x m matrix y 
Output: L\ E\ r', and p' 
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[8]1 Set i = 0, W and L as empty. 

|8]2 For each column j = 0, 1, . . . , n — 1 

a) While Aq.o — and j < n — i, shiftup(A, i), 
shiftup(y, i), shiftup(i', «). 

b) If Aofi is not zero, eliminate(A), reduce(y, A), 
shiftup(£',0), + 

c) Otherwise, shiftleft(A), append the first column 
of A to set the top-right element of L' to one, 
and add j to U'. 

|8]3 Set ji' = n — i. The deviations E' are given by the first 
jj! rows of y. 

|8]4 For each column j £ U' , sliiftup(-L', j) and shiftup(y, j). 
|8]5 The received vector r' is given by y. 

In Algorithm[8] we incorporate the extraction of L' , E', and 
r' into Gaussian elimination. So our architecture has the same 
worst-case latency as Algorithm |7] and requires no extra cycles 
to extract L out of the n-RRE form. Hence the throughput also 
remains the same. 

Algorithm|7]is implemented by the regular architecture shown 
in Fig. |5] which is a two-dimensional array of rn x 2m process- 
ing elements (GE's); The leftmost m columns of processing 
elements correspond to M, and the rightmost m columns B. 
Algorithm [8] can be implemented with the same architecture 
with N X (n + m) GE's; The leftmost n columns of processing 
elements correspond to A, and the rightmost m columns y. 
The elements for L' are omitted in the figure. The circuitry 
of the processing element GE is shown in Fig. |6] For row i, 
the control signal ct^ for row i chooses from five inputs based 
on the operation: keeping the value, shiftlcft, eliminate (or 
reduce), and shiftup (using the first row or the next row). 



In Gabidulin's algorithm, there are two t x r matrices over 
Fqm, A and Q. In step[T]2, it requires only Qi.o's to compute 
the coefficients. To compute Qi o in (|6]l, it requires only Qi-ifi 
and Qi-i.i- And for Qij in (|6]l, it requires only Qi-ij and 
Qi-ij+i- Recursively only those Qij's where i + j < t are 
necessary. Actually, given any i, entries Qi.o, Qi+i,Q, ■ ■ ■ , Qr-i.o 
can be computed with the entries Qi-i^o, Qi-i,i, ■ ■ ■ , Qi-i,T~i- 
With Qo,Q, Qi.o, ■ ■ ■ , Qi-2,Q, we need to store only r values to 
keep track of Q. Hence we reduce the storage of Q from t x r 
m-bit registers down to t. We cannot reduce the storage of A 
to t{t + 1)/2 because we have to use the pivoting scheme for 
short critical paths. 



ctau,- 




Fig. 8. The processing element AEtj 



^oj A, 




A 




Fig. 6. Tlie processing element GEi j 



Fig. 9. The processing element QEj 



D. Gabidulin 's Algorithm 

In Gabidulin's algorithm, the matrix is first reduced to a 
triangular form. It takes advantage of the property of the matrix 
such that it requires no division in the first stage. In the first 
stage, we need to perform elimination on only one row. We 
use a similar pivoting scheme like Algorithm|2l When a row is 
reduced to have only one non-zero element, a division is used 
to obtain one coefficient of X. Then it performs a backward 
elimination after getting each coefficient. Hence we introduce 
a backward pivoting scheme, where the pivot element is always 
at the bottom-right corner. 



In our decoder, Gabidulin's Algorithm is implemented by 
the regular architecture shown in Fig. |7] which include a tri- 
angular array of r x r AE's a one-dimensional array of r 
QE's. The circuitry of the processing element AE^ j and QE^ 
is shown in Fig. [8]and|9] The upper MUX in AE controls the 
output sending upward along the diagonal. Its control signal 
ctaui is 1 for the second row and for other rows since we 
update A one row in a cycle and we keep the pivot on the 
upper left corner in Step[T]l. The control of the lower MUX in 
AE is for working on Step[T| 1, and 1 for working on Step[T]2. 
Similary the control of the MUX in QE is for working on 
Step[T]l, and 1 for working on Step[T]2. But in Step[T]l, only 
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Fig. 5. Regular architecture for Gaussian elimination 
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Fig. 7. Our architecture of Gabidulin's algorithm 



part of QE's need update and others should maintain their 
values and their control signals ctq/s are set to 2. Initially, 
Ao^i = Ei and qi = Si for i = 0, 1, . . . , r — 1. Step[T]l needs 
T substeps. In the first t — 1 substeps, ctal^+i — 0, ctaui = 1, 
ctqo = ctqi = • • • = ctq^ = 2, and ctq,_^i = ctq,+2 = • ■ • = 
ct(\^_i = for substep i. In the last substep, ctaui = and 
all ctqj's are set to 2. This substep is to put the updated A 
into the original position. In Step[T]2, the pivot is in the right 
lower corner, where we compute X^'s. Step [T]2 also needs 
r substeps, in which all ctali's and ctqj's are set to 1. First 
Xr-i is computed by A^^i r-ilr ~ 1 where qr-i — Qt-i.o- 
Note that the inversion may need m — 2 clock cycles. In each 



substep, the matrix A is moving down the diagonal so the Ai^i 
to be inverted is always at the bottom right corner. At the same 
time, the g^'s are also moving down. Basically, in substep p, 
the architecture updates qi^ to Qi-p,o — J2]=l-i-p 
for i > phy doing one backward elimination at each substep. 



E. Low Complexity Linearized Interpolation 

Note that in each iteration of step |4l2(b), the g-degree of 
F{x) is no more than i+1 and we need only Wi+i, ■ ■ ■ , Wp-i. 
Hence the coefficients of F{x) and Wi's can be packed into 
a register vector of length p + 1, where F{x) and w^'s are 
updated currently, as in Algorithm |9] Along with the update. 
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more registers are used for F{x) and less for Wi's. In Algo- 
rithm |9] we keep Fp 7p*^ ~ 1 during the interpolation. 

Algorithm 9 (Reformulated Algorithm for Minimal Linearized 
Polynomials). 

Input: Roots wo,wi, . . . , Wp^i 

Output: The minimal linearized polynomial F{x) 

El Initialize jf^ ^ wj for j = 0, 1, . . . , g - 1, 7^°^ = 1, 

and a = 0. 
|9]2 Fori = 0,1,..., 

a) If 7^*^ ^ 0, 

i) For J = 0, 1, . . . ,p-l-a, 7]'+'^ = (7]^i)['l + 

(7^^¥-WSi; 

ii) For j=p-a,p-a + l,...,p-~l, 7]*^^^ = 

(7f)W + (7^^V-W]^i; 

iii) a = a + 1. 

b) Otherwise, for j = 0,1, . . . ,p — 1 — a, 7^*^^'' = 

7]+i' rp-J = 0, and 7} ^ = 7] ^ for j = p - 
a + 1, . . . , p — 1. 

c) 7p - 7p ■ 

|9]3 For i = p,p + 1, . . . , 2p — a — 1, 

a) For J - 0, 1, ... ,p - 1, 7]'+'^ = 7^1 and 7^ = 
0. 

04 F(x) = ELo7r"-'^W- 

The interpolation is actually done after Step |9]2. But if the 
roots are not linearly independent, F{x) may not be aligned 
with 7o. Therefore, in Step|9]3, we shift the registers until 70 
is non-zero. In other words, it is shifted to the left for p— 1 — a 
times. 

Algorithm |9] is implemented by the systolic architecture 
shown in Fig. [TO] which consists of p processing elements 
(ME's). We note that we need only p processing elements 
since we assume that the highest coefficient is fixed to 1 . The 
circuitry of the processing element MEj is shown in Fig. [TT] 
The cr signal is 1 only when 70 7^ 0. The ctj signal for each 
cell is 1 only if j < p — a. Basically, ctj controls if the update 
is for F{x) or w^'s. 



Fq Fp^i Fp 

T T T , 






MEo 




MEp_i 




MCtrl 













Fig. 10. Architecture of linearized polynomial interpolation 



F. Decoding Failure 

A complete decoder declares decoding failure when no valid 
codeword is found within the decoding radius of the received 
word. To the best of our knowledge, decoding failures of 
Gabidulin and KK codes were not discussed in previous works. 
Similar to RS decoding algorithms, a rank decoder can return 
decoding failure when the roots of the error span polynomial 
\{x) are not unique. That is, the root space of A (a;) has a 



7j 




Fig. 1 1 . The processing element MEj {x^ is a cyclic shift, and requires no 
hardware but wiring) 

dimension smaller than the g-degree of X{x). Note that this 
applies to both Gabidulin and KK decoders. For KK decoders, 
another condition of decoding failure is when the total number 
of erasures and deviations exceeds the decoding bound d~ 1. 

G. Latency and Throughput 

We analyze the worst-case decoding latencies of our decoder 
architectures, in terms of clock cycles, in Table |III] 

TABLE III 

Worst-case decoding latency (in terms of clock cycles) 





Gabidulin 


KK 


n-RRE 




n{N +l)/2 


Syndrome S 


n 


n 


\u{x) 




-It 






-It 


Sdu{x) 




2(d - 1) 


BMA 


2t 


2t 


Sfd{x) 




d- 1 


/3 




(m + 2)(d - 1) 


<T[/(x) 




4t 


a{x) 




d- 1 


root space basis E 


ni{m + l)/2 


m{rn + l)/2 


error locator L 


2t + mt 


(m + 2){d - 1) 


error word e 


t 


2t 



As in 14111 . the latency of Gaussian elimination for the n- 
RRE form is at most n{N + l)/2 cycles. Similarly, the la- 
tency of finding the root space is at most m{m + l)/2. For 
Gabidulin's algorithm, it needs one cycle per row for forward 
elimination and the same for backward elimination. For each 
coefficient, it takes m cycles to perform a division. Hence it 
needs at most 2{d — 1) + m(d — 1) and 2{d — 1) + m{d — 1) 
for (3 and L respectively. The latencies of finding the minimal 
linearized polynomials are determined by the number of regis- 
ters, which is 2t to accommodate Xd{x), aoix), and au{x), 
whose degrees are n, 5, fi, respectively. The 2t syndromes 
can be computed by 2t sets of multiply-and-accumulators in n 
cycles. Note that the computations of S{x), Xu{x), and anix) 
can be done concurrently. The latency of RiBMA is 2t for 2t 
iterations. The latency of a symbolic product a{x) (g) b{x) is 
determined by the (/-degree of a{x). When computing Sdu{x), 
we are concerned about only the terms of g-degree less than 
d — 1 because only those are meaningful for the key equa- 
tion. For computing Sf£){x), the result of (Jd{x) ® '5'(a;) in 
Sdu{x) can be reused, so it needs only one symbolic product. 
In total, assuming n — to, the decoding latencies of our 
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Gabidulin and KK decoders are n{n + 3)/2 + (n + 5)t and 
n{N + n + 4)/2 + {An + 24)t cycles, respectively. 

One assumption in our analysis is that the unit that computes 
x"^"^ in Figs.l9land[mis implemented with pure combinational 
logic, which leads to a long CPD for large q's. For large 
g's, to achieve a short CPD, it is necessary to pipeline the 
unit that computes x''^^. There are two ways to pipeline it: 
x'^~^ — X ■ x"^ ■ ■ ■ x'^l'^ that requires logg 9 — 1 multiplications, 
or x"?^^ = x'^ jx that requires m multiplications for division. 
To maintain a short CPD, a;*"^ needs to be implemented 
sequentially with one clock cycle for each multiplication. Let 
Cqrn — Hiin {log2 Q — 1, m) and it requires at most 2{cqm+2)t 
clock cycles for getting minimal Unearized polynomials Xu{x), 
(Jd{x), and au{x). Similarly, it requires at most Cqm{d — 1) 
more cycles to perform forward elimination in Gabidulin's 
algorithm for the error locator, and the latency of this step 
will be (m + Cqm + 2){d— 1) cycles. 

In our architectures, we use a block-level pipeline scheme 
for high throughput. Data transfers between modules are buffered 
into multiple stages so the throughput is determined by only 
the longest latency of a single module. For brevity, we present 
only the data flow of our pipelined Gabidulin decoder in Fig. [12] 
The data in different pipeline stage are for different decoding 
sessions. Hence these five units can work on five different 
sessions currently for higher throughput. If some block finishes 
before others, it cannot start another session until all are fin- 
ished. So the throughput of our block-level pipeline decoders, 
the throughput is determined by the block with the longest 
latency. For both Gabidulin and KK decoders, the block of 
finding root space is the bottleneck that requires the longest 
latency in the worst case scenario. Both decoders have the 
same bottleneck latency despite different areas. 



which can correct errors of rank up to two and four, respec- 
tively. We also implement our decoder architecture for their 
corresponding KK codes, which can correct e errors, /i era- 
sures, and 5 deviations as long as 2e + /i + (5 is no more 
than five or nine, respectively. Our designs are synthesized 
using Cadence RTL Compiler 7.1 and MOSIS SCMOS TSMC 
0.18/im standard cell library ll42l . The synthesis results are 
given in Tables |IV] and |Vl The total area in Tables |IV] and 
[V] include both cell area and estimated net area, and the total 
power in Tables HVl and W\ includes both leakage and estimated 
dynamic power All estimation are made by the synthesis tool. 
To provide a reference for comparison, the gate count of our 
(8, 4) KK decoder is only 63% to that of the (255, 239) RS de- 
coder over the same field in [43 1, which is 1 15,500. So for 
Gabidulin and KK codes over small fields, which have limited 
error-correcting capabilities, their hardware implementations 
are feasible. The area and power of decoder architectures in 
Tables IrV] and rv] appear affordable except for applications with 
very stringent area and power requirements. 

TABLE IV 

Synthesis results of decoders for (8, 4) Gabidulin and KK 

CODES OVER F28 





Gabidulin 


KK 


Gates 


19420 


72527 


Area (mm^) 


Cell 


0.466 


1.741 


Net 


0.171 


0.625 


Total 


0.637 


2.366 


CPD (ns) 


3.572 


3.696 


Estimated 
Power (mW) 


Leakage 


0.001 


0.003 


Dynamic 


93.872 


350.288 


Total 


93.873 


350.291 


Latency (cycles) 


70 


192 


Bottleneck (cycles) 


36 


36 


Throughput (Mbit/s) 


498 


962 



V. Implementation Results and Discussions 

To evaluate the performance of our decoder architectures, 
we implement our architectures for Gabidulin and KK codes 
for RLNC over F2. Note that although the random linear 
combinations are carried out over F2, decoding of Gabidulin 
and KK codes are performed over extension fields of F2. 

Due to hardware limitation in Fig. |5] We need to restrict 
TV. In our implementation of KK decoders, we assume N is 
no more than m. When N > m, the number of deviations 
TV — n is at least N — m since n < m. If is greater than m, 
we simply choose m linearly independent packets randomly. 
In some cases, it may lose useful information of deviations 
and hence gets higher error rates. In other cases, the extra 
deviations are not introduced into the received codeword and 
they reduce error correcting capability. So choosing N — ra 
is a trade-off to avoid false deviations at the cost of possibly 
useful deviations. If N is smaller than m, it implies E' is 
not linearly independent. But since Algorithm |4] can handle 
linearly dependent roots, the whole decoder works the same. 

A. Implementation Results 

We implement our decoder architecture in Verilog for an 
(8,4) Gabidulin code over F28 and a (16,8) one over F216, 



TABLE V 

Synthesis results of decoders for (16, 8) Gabidulin and KK 

CODES OVER Fjie 





Gabidulin 


KK 


Gates 


129218 


437362 


Ai'ea (mm^) 


Cell 


3.101 


10.497 


Net 


1.452 


4.752 


Total 


4.552 


15.249 


CPD (ns) 


4.427 


4.651 


Estimated 
Power (mW) 


Leakage 


0.007 


0.024 


Dynamic 


738.171 


2725.035 


Total 


738.178 


2725.059 


Latency (cycles) 


236 


640 


Bottleneck (cycles) 


136 


136 


Throughput (Mbit/s) 


425 


809 



B. Implementation Results of Long Codes 

Although the area and power shown in Tables |IV] and |V] 
are affordable and high throughputs are achieved, the Gabi- 
dulin and KK codes have very limited block lengths 8 and 
16. For practical network applications, the packet size may 
be large 122|. One approach to increase the block length of 
a constant-dimension code is to lift a Cartesian product of 
Gabidulin codes [40 1. We also consider the hardware imple- 
mentations for this case. We assume a packet size of 512 bytes. 
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Fig. 12. Data flow of our pipelined Gabidulin decoder 



and use a KK code that is based on Cartesian product of 5 1 1 
length-8 Gabidulin codes. As observed in Section IIII-EI the 
71-RRE form allows us to either decode this long KK code in 
serial, partly parallel, or fully parallel fashion. For example, 
more decoder modules can be used to decode in parallel for 
higher throughput. We list the gate counts and throughput of 
the serial and factor-7 parallel schemes based on the (8,4) 
KK decoder in Table [Vl] and those of the serial and factor-5 
parallel schemes based on the (16, 8) KK decoder in Table lVIII 



TABLE VI 

Performance of (8, 4) KK decoders for 5 1 2-b yte packets 





Serial 


7-Parallel 


Gates 


72527 


507689 


Area (mm'^) 


2.366 


16.562 


Estimated Power (mW) 


350.291 


2452.037 


Latency (cycles) 


18552 


2784 


Thi-oughput (Mbit/s) 


482 


3374 



TABLE VII 

Performance OF (16, 8) KK decoders for 5 12-byte packets 





Serial 


5-Parallel 


Gates 


437362 


2186810 


Area (mm'^) 


15.249 


76.245 


Estimated Power (mW) 


2725.035 


13625.175 


Latency (cycles) 


35184 


7440 


Throughput (Mbit/s) 


406 


2030 



C. Discussions 

Our implementation results above show that the hardware 
implementations of RLNC over small fields and with limited 
error control are quite feasible, unless there are very stringent 
area and power requirements. However, small field sizes imply 
limited block length and limited error control. As shown above, 
the block length of a constant-dimension code can be increased 
by lifting a Cartesian product of Gabidulin codes. While this 
easily provides arbitrarily long block length, it does not ad- 
dress the limited error control associated with small field sizes. 



For example, a Cartesian product of (8, 4) Gabidulin codes has 
the same error correction capability as the (8, 4) KK decoder, 
and their corresponding constant-dimension codes also have 
the same error correction capability. If we want to increase 
the error correction capabilities of both Gabidulin and KK 
codes, longer codes are needed and in turn larger fields are 
required. A larger field size implies a higher complexity for 
finite field arithmetic, and longer codes with greater error 
correction capability also lead to higher complexity. It remains 
to be seen whether the decoder architectures continue to be 
affordable for longer codes over larger fields, and this will be 
the subject of our future work. 

VI. Conclusion 

This paper presents novel hardware architectures for Ga- 
bidulin and KK decoders. It not only reduces the algorithm 
complexity but also fits the decoders into regular architectures 
suitable for circuit implementation. Synthesis results using a 
standard cell library confirm that our designs achieve high 
speed and high throughput. 
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Appendix A 
Proof of Lemma[T] 



Proof: This follows the proof of RO] Proposition 7] closely. 
Let the RRE and an n-RRE forms of Y be given by 



RRE(y) = 

Y' = 



rw f 1 

L Ei 

W f 

E' 



Since the RRE form of A is unique, W = W. Thus, ft = i^l' 
and S — 5'. In the proof of Ii40. Proposition 7], U is chosen 
based on W. Thus, we choose U = W . Since L is uniquely 
determined by W and L' is by W , we also have L = L' . 
Finally, choosing r' = Iw^r', we can show that ^ holds by 
following the same steps as in the proof of 1.40. Proposition 7]. 
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Appendix B 
Proof of Lemma|2] 

Proof: This follows a similar approach as in 19] Appendix 
C]. We have 



Suppose the received matrix has an n-RRE form as 



rank 



= rank 



rank 



rank 



= rank 



= rank 



rank 



X 

r' 
E' 

r' — X 
r' 
E' 



r — X 



tT 
E' 



ri „ 



(11) 



X'/^, r'-x 


•-itT 

' h 




L'J^, r-x 



E' 
E' 



rank [/^ 



(12) 



L' r' - X 
E' 



+ n — fi' 



where (E]) follows from I^, [I + L'l^, | r] = and Oil) 
follows from I^,Iu'a = 0. 

Since rankX+rankl^ = 2n— fi' +5' , the subspace distance 
is given by dsi{X), (Y)) = 2rank[^]-rankX-rankl^ = 
2rank[^' '''r,^] -/i' -(5'. ■ 



Let q = 2, N = 
generated by g{x) 



Appendix C 
Decoding Example 

n = m = 8, and k 



4. The field F28 is 
1. Suppose we use 



a Gabidulin code whose parity-check matrix is given by 



H 



2 4 16 169 24 233 205 130 

4 16 169 24 233 205 130 2 

16 169 24 233 205 130 2 4 

169 24 233 205 130 2 4 16 



It is easy to verify that x = (36, 28, 200, 56, 228, 208, 5, 98)^ 
is a codeword. We expand them over the F2 starting from the 
least significant bit. Hence the sent lifting subspace codeword 
is given by 
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Y' = 
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which can be expanded into 
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such that the erasures are given by L' — ((0, 0, 0, 0, 0, 0, 1, 0)"^ 
(0, 0, 0, 0, 0, 0, 0, 1)^), the received vector is 

r' = (219, 227, 55, 199, 27, 47, 0, 0) 



'] = 4, which is 



and the deviations are E' = (254,255). 

We know ^ = 5 = 2 and rank[ ^ r,'^ ] 
smaller than the minimum distance a = n — fc + 1 = 5, so it 
is decodable. 

The decoding process is as follows: 

1) Computing the auxiliary syndrome polynomial 

a) syndromes S = Hr' = (185, 169, 45, 130) 

b) erasures X = L'^h = (205, 130) 

c) erasure polynomial A;7(a;) = minpoly(X) = a;[^I + 
146a;[il + 49a;M 

d) deviation polynomial <jd{x) = minpoly(£;') = 
^[2] + 142a:[il + 143a:M 

e) \u[x) = 49a;[2l +_201a;[il + and Sdu[x) = 
aD{x) (g) S{x) ® \u{x) = 142a;[^l + igSa^I^l + 
141x1^1 + 44a;W + \2Ax^^^ + 2b\x^'^^ + 53a;[il + 
241a;["l. 

2) Computing the error span polynomial 

a) solving upix) ® Sdu[x) = uj{x) mod x^'^~^\ in 
which deg a;(x) < t and /i + (5<r<(i — 1, gives 
apix) =x[ol 

b) Sfd{x) = (TFix)(g>aDix)(E>S{x) = 4a;[5l +225x^4 
140a:[''^l + 115a;[2l + 216a;l^l + 241x[°l; 

c) solving Sfd.i = E^=o ^f^i ^r 1 = 2,3^ gives 
(3 = (48, 186) 

d) au{x) = minpoly(/3) = xl^l + 150a;[il + 69x[ol 

e) <t{x) ^ (7u{x)®(Tf{x)®(Jd{x) = +238x13] + 
187a;[2] + I41a;[il +217a;["l. 

3) Finding the roots of the error span polynomial a{x): 
E = (254,157,4,251) 

iX, = Xi,i = 0, 1, ...,At-l. 
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4) Finding the error locations 

a) solving Si = Ej=o^f^J'^ = 0,l,...,d-2 
where the rank of error word r < d — 1, gives 
X = (205,130,204,1) 

b) the error locations L = (64, 128, 191, 255) and 
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c) 



such that X = L'^h 

the error word is given by 



T-l 



3=0 

= (255, 255, 255, 255, 255, 255, 5, 98) 



d) 



the decoding output is 



x = x = (36, 28, 200, 56, 228, 208, 5, 



98)^. 



